How an Old-Fashioned Math Nerd Can Take Advantage of ML Hype?

Alexander Chichigin

Disclaimer

THE AUTHOR SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Old-Fashioned Math

Lorenz Attractor

Math vs. ML

ML

  • Deep Neural Networks
  • Recurrent Neural Networks
  • Residual Networks
  • Extreme Learning Machines

Math

  • Ordinary Differential Equations
  • Partial Differential Equations
  • Stochastic Differential Equations

Neural ODEs

Ordinary Differential Equations

\[ F(x, y, y', \dots, y^{(n-1)}) = y^{(n)} \]

\[ y' = F(x, y) \]

Vector field

Euler’s method

\[ y_{n+1} = y_n + h \times F(x_n, y_n) \]

Neural Networks

Feed-Forward Neural Network

Math View

\[ \overrightarrow{y}_{n+1} = F_n \left( \mathbf{W}_n\times \overrightarrow{y}_n + \overrightarrow{b}_n \right) \]

where

\[ F_n \in \left\{ \sigma, ~\mathrm{ReLU}, ~\dots \right\} \]

Residual Neural Networks

ResNet

Math View

\[ \overrightarrow{y}_{n+1} = \overrightarrow{y}_n + F_n \left( \mathbf{W}_n\times \overrightarrow{y}_n + \overrightarrow{b}_n \right) \]

\[ \overrightarrow{y}(t + dt) = \overrightarrow{y}(t) + dt \times G \left( \overrightarrow{y}(t), t, \theta \right) \]

\[ \frac{ d\overrightarrow{y}(t) }{dt} = G \left( \overrightarrow{y}(t), t, \theta \right) \]

Automate!

\[ \frac{ d\mathbf{y}(t)}{dt} = G \left( \mathbf{y}(t), t, \theta \right) \]

\[ \mathbf{y}(t_1) = \mathtt{ODESolve}(\mathbf{y}(t_0), G, t_0, t_1, \theta) \]

Train?

\[ L(\mathbf{y}(t_1)) = L(\mathtt{ODESolve}(\mathbf{y}(t_0), G, t_0, t_1, \theta)) \]

Back Propagation???

Automatic Differentiation FTW!!!1

Physics-Informed Neural Networks

ODEs Once More

\[ u' = f(u, x) \] and \[ u(0) = u_0 \]

Approximate Numerical Solution

\[ u(x) \approx NN_{\theta}(x) \]

Ideally \[ \frac{d NN_{\theta}}{dx}(x) = f(NN_{\theta}(x), x) \]

The Loss

\[ L(\theta) = \sum_i \left( \frac{d NN_{\theta}}{dx}(x_i) - f(NN_{\theta}(x_i), x_i) \right)^2 \]

Initial Condition?!

\[ g_{\theta}(x) = u_0 + x \times NN_{\theta}(x) \]

\[ L(\theta) = \sum_i \left( \frac{d g_{\theta}}{dx}(x_i) - f(g_{\theta}(x_i), x_i) \right)^2 \]

That’s not PINN!

\[ u_t + \mathcal{N}[u] = 0, x \in \Omega, t \in [0, T] \]

\[ f := u_t + \mathcal{N}[u] \]

\[ u(t, x) \approx NN_{\theta}(t, x) \]

Geater Loss

\[ MSE = MSE_u + MSE_f \] where

\[ MSE_u = \frac{1}{N_u} \sum_{i=1}^{N_u} \left( u(t_u^i, x_u^i) - u^i \right)^2 \] and

\[ MSE_f = \frac{1}{N_f} \sum_{i=1}^{N_f} \left( f(t_f^i, x_f^i) \right)^2 \]

What about PDE parameters?

\[ u_t + \mathcal{N}[u; \lambda] = 0, x \in \Omega, t \in [0, T] \]

\[ f := u_t + \mathcal{N}[u; \lambda] \]

\[ u(t, x) \approx NN_{\theta}(t, x) \]

And we need to go deeper!

\[ \mathcal{N}[u(t), u(\alpha(t)), W(t), U_\theta(u, \beta(t))] = 0 \]

where \(\alpha(t)\) is a delay function and \(W(t)\) is the Wiener process.

ODE meme

Conclusions

So What About Taking Advantage?

  • Publications
    • Journals
    • Conferences
  • Industry
    • System Identification
    • Data-Driven Optimal Control
  • Grants

How an Old-Fashioned Math Nerd Can Approach ML?

I don’t really know, but…

  • Optimization
  • Gradient Descent limit as a Differential Equation
  • Gaussian Processes
  • Backward Error Analysis
  • Neural Tangent Kernels

References